Representing Semantic Relationships in Ancient IE Languages
A Pilot Study
Anton Vinogradov, Gabriel Wallace, and Andrew Byrd
University of Kentucky
2024-10-25
Anton sends his regrets
Dr. Anton Vinogradov,(recent!) PhD, Computer Science
DERBi PIE
![]()
Database of Etymological Roots Beginning in PIE
DERBi PIE
- Etymological database, with multiple, linked references
- all of LIV parsed (thanks Thomas Olander!)
- half of Pokorny fully parsed (will finish next July, provided funding)
- will finish parsing NIL this December
- ultimately: everything (?!?)
- Have applied for NEH funding, hopefully this will put us closer to the goal of releasing to public a year from now.
DERBi PIE: Query Searches
- Search Functions Created
- Integrated Texts - identify roots, stems, and words in texts
- Phonological Search - identify roots, stems, and words by phonological shape (regex)
- Morphological Search - identify roots, stems, and words by morphological property (POS, class, gender, etc.)
DERBi PIE: Query Searches
- Quickly realized that to identify semantic categories and relationships – especially ones that make sense across languages – is not an easy task
- Given cross-linguistic variation, there is no unified classification of “words”
DERBi PIE: What could this do for us?
DERBi PIE: the Problem
- This is the problem: how on earth can we do this?
WordNet: What is it?
- A large, organized lexical database of English words
- Groups words into “synsets” (sets of synonyms) based on meanings
WordNet: What is it?
- Provides relationships between words, such as:
- Synonyms (similar meanings, e.g. “dog”, “pooch”)
- Antonyms (opposite meanings, e.g. “bad”, “good”)
- Hypernyms (general terms, e.g., “animal” for “dog”)
- Hyponyms (specific terms, e.g., “dog” for “animal”)
WordNet: What is it? (Kazakov & Dobnik 2003)
WordNet: Linked WordNets for Ancient Indo-European languages (Zanchi & Ginevra)
WordNet: Building One for PIE?
- Working with a team of UK CS undergraduates, we mapped a list of PIE roots and words (primarily from Pokorny) onto the English WordNet structure
- 1500 roots successfully mapped, though many are false matches (‘golf’?)
WordNet: Building One for PIE?
- Numerous substantive difficulties with entries that do not map neatly:
- *aghlu- (IEW, p. 8) ‘dark cloud; rainy weather’: new hyponym (of both ‘cloud’ and ‘weather’) needed
- *ab- (IEW, p. 1) ‘water, river’: new hypernym needed? or two mappings?
- The general idea of English (or any other language) ↦ PIE is difficult to implement, because PIE ≠ English (etc.)!
WordNet: Broader Problems
- Limited Scope of Meanings: Doesn’t capture all nuances of word usage
- Lack of Context: Doesn’t account for how context alters word meaning
- Not All Languages Have WordNet: especially true for ancient/fragmentary languages
- MUST BE DONE MANUALLY
Tactic #2: Reconstructing Word Embeddings using Descendant Languages
Word Embeddings: Overview
Word Embeddings: Hyperspace
![]()
Plots like these can be created with generated vectors!
Word Embeddings: Hyperspace…?
If you were to plot out every vector generated from one of these models, you would have a hyperspace or a semantic space, with the dimensions of each word vector essentially acting as coordinates.
The closer two words lie to each other in this space, the closer in semantic value they are.
It’s possible to adjust how many dimensions you generate for each vector, but in general the more the better.
- More fancy math allows us to ‘simplify’ these vectors into 2 or 3 for easy viewing, as seen before.
Word Embeddings: Processing the Text for Tokens
Word Embeddings: Processing the Text for Lemmas
Word Embeddings: Identifying the Context
Word Embeddings: Constructing the Hyperspace
- Using a word embeddings model such as word2vec or fastText, run the tokenized and lemmatized text through it to generate word vectors.
- These vectors will be based on the words’ positions within the text.
Word Embeddings: Constructing the Hyperspace
Word Embeddings: Calculate Word Similarities
- To get a numerical representation of the similarity between two vectors, we use their cosine similarity (cosine of the angle between the vectors).
- Similarity scores range from -1 to 1, with -1 indicating opposite vectors and 1 indicating proportional vectors.
![]()
Fancy formula from Wikipedia
Word Embeddings: Example of Cosine Similarity
Word Embeddings: Reconstructing a Hyperspace in Languages without a Corpus
- We can construct a hyperspace for ancient languages;
- But how does one do this for languages without any known corpus, such as PIE?
- Well, how do we identify other properties of PIE?
Word Embeddings : Basic Idea
- We take two, similar properties in two related languages, which allow us to approximate an earlier state in a source language
Word Embeddings : Basic Idea
- It is in this way that we propose that the use of word embedding models created by descendant languages, to approximate an earlier state in the source language
Word Embeddings: Methods
- As you can imagine, this stuff is complicated, which is why we won’t go into much detail about the specific methods;
- see Github for four-page paper, code, data, etc.
- If there are any questions that we can’t answer, we’ll forward them to our main collaborator, Anton, who will be happy to do so
Word Embeddings, Problem #1: Vectors Across Models
- Vectors generated for hyperspaces take on arbitrary values when training models
- So ‘dog’ could = (0, 0) or (-6, 100) – these values change every time you run the model
- For this reason, we must align models (Dev et al., 2021):
- Identify substructures across language models that remain fixed
- Use pre-aligned word embedding models (following Joulin et al., 2018)
Word Embeddings, Problem #2: Verification
- So we take pre-aligned hyperspaces, and reconstruct an earlier, source hyperspace based on the hyperspaces provided
- But how can we trust this methodology?
- Obviously can’t verify the hyperspace of PIE through analysis of PIE texts!
Word Embeddings, Problem #2: Verification
Word Embeddings: Methods
- We use existing aligned models of Spanish & French (Joulin et al., 2018) as a source of vector and word information
- Words filtered out:
- If there is no corresponding word in the other languages
- Non-vocabulary, including words with non-language characters
Word Embeddings: Methods
![]()
Models trained on French & Spanish Wikipedia articles, include both vocabulary and non-words/words containing non-language characters, the latter of which were removed (7.3%, 6.6%, respectively)
Remaining words lemmatized, furthering reducing vocabulary by roughly 10%
Word Embeddings: Methods
- To relate words together and find common words:
- Words are translated into each other’s respective languages, using Google Translate, with a Python translation library deep-translator
- The same is done with Latin, using the Latin corpus (from CLTK [the Tesserae Project]), which is lemmatized
- Any word that cannot be lemmatized in Latin (such as Greek words) is removed from the corpus
Word Embeddings: Calculating *Latin
![]()
- If there isn’t a 1:1 correspondence between the Romance language & Latin, identify the centroid of lemma’s vector: language-word center
- Identify the centroid of both lwcs -> inter-language-word center
- Identify the closest vectors to the ilwc using cosine-distance
- Take the average of these two vectors to arrive at the approximate *Latin word vector
Word Embeddings: Results
- To identify the effectiveness of the method, recall that we want to compare the *Latin with the Latin
- This is our “normal” model
Word Embeddings: Analogy vs. OddOneOut
- The analogy task (Mikolov et al. 2013) is considered standard when evaluating word embedding models: London is to England as Paris is France
- But it doesn’t work for languages with small corpora (LRLs), especially ones that aren’t modern
- We follow Stringham & Izbicki 2020 in using the OddOneOut task, which has been demonstrated to be more accurate in these situations
- OddOneOut task demonstrated to work for corpora as small as 1800 tokens (Old Gujarati)
Word Embeddings: Results
- Our results indicate mixed success: we can create a hyperspace from the descendant languages that performs reasonably well on Latin tests with OddOneOut
- Preliminary tests show that the smaller the corpus, the better the model works, as compared to the normal one
- This is not exactly a bad thing, as this is the case for many languages in our discipline!
- If a language has a large corpus (such as Ancient Greek), we can reduce it as we’ve done here for Latin
Word Embeddings: Results
- Unclear what is the “magic number” for the size of corpus to arrive at most accurate representation (compared to the normal model)
- Unclear exactly why performance of descendant models tends to decrease as the corpus size increases
- Anton has ideas, all quite technical – see the draft for further
Word Embeddings: Problems with Current Model
- Use of Google translate suboptimal, may result in translation errors; should use either bilingual dictionaries or LLMs (think GPT) for more accurate translations
- LLMs > Word Embeddings
- Vectors: polysemy (e.g., ‘bank’)
- Vectors: context
- Vectors: precision (300 vs. 175B parameters)
Recap & Future Directions
Recap: WordNet
- Upsides: semi-universal structure
- Downsides:
- must be done manually; requires scholars to make choices that are sometimes unknowable;
- isn’t capable of showing certain types of semantic similarities/differences beyond synonymy, hyponymy, etc.
Recap: Word Embeddings
- Upsides: fully automated, requires low CPU processing
- Downsides: doesn’t distinguish multiple senses (polysemy), is less accurate than LLMs
Future Directions: WordNet Embedding?
Johannson and Pietro Nina 2015: build hyperspaces from systems like WordNet
We should be able to do this for many IE languages (mostly modern)
For languages without WordNets (like PIE), we “translate” the lexicon (< DERBi PIE) into a WordNet structure
Eliminate any matchings that are untrue (like *i̯eh₂- “drive” = “hit a golfball”)
- Manually assign outliers as hyponyms, hypernyms, synonyms, etc. of existing lexemes
But same problems from before remain
Future Directions: using LLM models
Download: Slides, Paper, Code
Representing Semantic Relationships in Ancient IE Languages A Pilot Study Anton Vinogradov, Gabriel Wallace, and Andrew Byrd University of Kentucky 2024-10-25